Note: I have only done the EDA to answer the asked questions. I have not done any EDA for the purpose of feature engineering or feature selection.
So, no missing data. Yayyy!
Before going into exploring relationship of predictors with the target, let’s first clearly define the target
Credit worthiness for a group of observations can be measured by Good/Total proportion. Higher the proportion, higher the credit worthiness
Question: Would a person with critical credit history, be more credit worthy?
Again, let’s first define what critical means. In the absence of any concrete definition, I will assume ‘critical’ roughly means more existing credits i.e. it increase from A30 to A35
`summarise()` ungrouping output (override with `.groups` argument)
Critical has positive association with credit worthiness
Q. Are young people more creditworthy?
`summarise()` ungrouping output (override with `.groups` argument)
The distributions are quite overlapping. But there are more young in “Bad” compared to “Good”, and that is also visible in the difference in means. > So, young people seem slightly less credit worthy.
But let’s break the age into groups to see finer details
`summarise()` ungrouping output (override with `.groups` argument)
“Bad” is quite low for the (34, 39] age group
Q. Would a person with more credit accounts, be more credit worthy?
I am assuming more credit accounts is same as “Number of existing credits at this bank” i.e. ‘count_existing_credits’
Data is too unreliable to say anything on the relationship between no. of credit accounts and credit worthiness
Consequently, there is no feature engineering.
For feature engineering I have used Boruta, which I have found to be the best feature selection technique almost always. Below is how the Boruta plot looks like:
Selected features are:
[1] "checking_account_status"
[2] "duration_in_months"
[3] "credit_history"
[4] "purpose"
[5] "credit_amount"
[6] "savings_account_status"
[7] "present_employment_since"
[8] "installment_as_percent_of_income"
[9] "role_in_other_credits"
[10] "assset_type"
[11] "age"
[12] "other_installment_plans"
[13] "housing_type"
[14] "employment_type"
[15] "is_credit_worthy"
checking_account_status
duration_in_months
credit_history
purpose
credit_amount
savings_account_status
present_employment_since
installment_as_percent_of_income
role_in_other_credits
assset_type
age
other_installment_plans
housing_type
employment_type
is_credit_worthy
It is worse to class a customer as ‘Good’ when they are ‘Bad’, than it is to class a customer as bad when they are good.
Let ‘Good’ be the positive class, and ‘Bad’ be the negative class. So the above statement will translate to:
> False Positives (FPs) are more expensive than False Negatives (FNs)
Such cases fall under **Cost Sensitive Learning" strategy, and followong sub-strategies can be followed decided under it:
I will try the following three models: - Logistic Regression - Boosted Trees: GBM - Random Forest
I will go with a Custom evaluation metric:
I have assigned follwing weights to different buckets of the confusion matrix to penalize each bucket differently
Reference
Prediction Good Bad
Good -0.4 1
Bad 0.2 0
There is no particular reason for these values, just their relative differences are important because they penalize FPs more than FNs. PLus, I am rewarding TPs (True Positives)
Now, the custom metric is just the normalized sum-product of these weights and the confusion matrix of the model. Let’s call it “credit_cost”.
I have 80:20 splitting. For validation, I will be using cross-validation wherever required.
I am taking baseline as predicting everybody as "Good’
Train credit_cost
Baseline Train Cost: 0.0206982543640898
Baseline Train Precision: 0.699501246882793
Test credit_cost
Baseline Test Cost: 0.0171717171717172
Baseline Test Precision: 0.702020202020202
Train Results:
Confusion Matrix and Statistics
Reference
Prediction Good Bad
Good 518 116
Bad 43 125
Accuracy : 0.802
95% CI : (0.772, 0.829)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.0000000000343
Kappa : 0.484
Mcnemar's Test P-Value : 0.0000000112995
Sensitivity : 0.923
Specificity : 0.519
Pos Pred Value : 0.817
Neg Pred Value : 0.744
Prevalence : 0.700
Detection Rate : 0.646
Detection Prevalence : 0.791
Balanced Accuracy : 0.721
'Positive' Class : Good
Train Results:
Confusion Matrix and Statistics
Reference
Prediction Good Bad
Good 543 16
Bad 18 225
Accuracy : 0.958
95% CI : (0.941, 0.97)
No Information Rate : 0.7
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.899
Mcnemar's Test P-Value : 0.864
Sensitivity : 0.968
Specificity : 0.934
Pos Pred Value : 0.971
Neg Pred Value : 0.926
Prevalence : 0.700
Detection Rate : 0.677
Detection Prevalence : 0.697
Balanced Accuracy : 0.951
'Positive' Class : Good
Train Results:
Confusion Matrix and Statistics
Reference
Prediction Good Bad
Good 535 52
Bad 26 189
Accuracy : 0.903
95% CI : (0.88, 0.922)
No Information Rate : 0.7
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.761
Mcnemar's Test P-Value : 0.00464
Sensitivity : 0.954
Specificity : 0.784
Pos Pred Value : 0.911
Neg Pred Value : 0.879
Prevalence : 0.700
Detection Rate : 0.667
Detection Prevalence : 0.732
Balanced Accuracy : 0.869
'Positive' Class : Good
Credit_cost and Pricision are in sync.
train results are best for GBM. But its overfitting, i.e. variance is high, so not that great results on test.
test results are best for Random Forest. It has less variance then GBM, but bias is higher.
It may seem like that GBM is a better model, but we still haven’t seen the uncertainity (variance) in the results. Difference between train and test set results give some idea about it, but its better to see it on cross-validated results.
Model Details:
==============
H2OBinomialModel: gbm
Model ID: gbm_grid_11_model_3
Model Summary:
number_of_trees number_of_internal_trees
1 50 50
model_size_in_bytes min_depth max_depth
1 16282 5 5
mean_depth min_leaves max_leaves mean_leaves
1 5.00000 12 27 21.26000
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 0.04534
RMSE: 0.2129
LogLoss: 0.1929
Mean Per-Class Error: 0.04519
AUC: 0.9927
AUCPR: 0.9953
Gini: 0.9853
R^2: 0.7393
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Bad Good Error Rate
Bad 225 16 0.066390 =16/241
Good 20 814 0.023981 =20/834
Totals 245 830 0.033488 =36/1075
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold
1 max f1 0.565630
2 max f2 0.379781
3 max f0point5 0.624613
4 max accuracy 0.585040
5 max precision 0.989265
6 max recall 0.321050
7 max specificity 0.989265
8 max absolute_mcc 0.585040
9 max min_per_class_accuracy 0.603570
10 max mean_per_class_accuracy 0.624613
11 max tns 0.989265
12 max fns 0.989265
13 max fps 0.020773
14 max tps 0.321050
15 max tnr 0.989265
16 max fnr 0.989265
17 max fpr 0.020773
18 max tpr 0.321050
value idx
1 0.978365 233
2 0.986266 274
3 0.985565 212
4 0.966512 227
5 1.000000 0
6 1.000000 287
7 1.000000 0
8 0.906286 227
9 0.962656 221
10 0.966521 212
11 241.000000 0
12 832.000000 0
13 241.000000 399
14 834.000000 287
15 1.000000 0
16 0.997602 0
17 1.000000 399
18 1.000000 287
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.1689
RMSE: 0.4109
LogLoss: 0.5094
Mean Per-Class Error: 0.4049
AUC: 0.7906
AUCPR: 0.8921
Gini: 0.5812
R^2: 0.1967
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Bad Good Error Rate
Bad 54 187 0.775934 =187/241
Good 19 542 0.033868 =19/561
Totals 73 729 0.256858 =206/802
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold
1 max f1 0.219487
2 max f2 0.109460
3 max f0point5 0.606021
4 max accuracy 0.443982
5 max precision 0.991849
6 max recall 0.045964
7 max specificity 0.991849
8 max absolute_mcc 0.606021
9 max min_per_class_accuracy 0.672135
10 max mean_per_class_accuracy 0.606021
11 max tns 0.991849
12 max fns 0.991849
13 max fps 0.024140
14 max tps 0.045964
15 max tnr 0.991849
16 max fnr 0.991849
17 max fpr 0.024140
18 max tpr 0.045964
value idx
1 0.840310 347
2 0.922619 381
3 0.834918 216
4 0.754364 268
5 1.000000 0
6 1.000000 396
7 1.000000 0
8 0.439563 216
9 0.725490 186
10 0.729440 216
11 241.000000 0
12 560.000000 0
13 241.000000 399
14 561.000000 396
15 1.000000 0
16 0.998217 0
17 1.000000 399
18 1.000000 396
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
mean sd cv_1_valid
accuracy 0.7699316 0.041614145 0.78443116
auc 0.7899245 0.036268797 0.7916667
aucpr 0.88372415 0.02987459 0.89833695
err 0.23006836 0.041614145 0.21556886
err_count 36.8 5.9329586 36.0
cv_2_valid cv_3_valid cv_4_valid
accuracy 0.7939394 0.7051282 0.8113208
auc 0.8346235 0.75 0.8153495
aucpr 0.91614044 0.8630901 0.8981989
err 0.2060606 0.2948718 0.18867925
err_count 34.0 46.0 30.0
cv_5_valid
accuracy 0.7548387
auc 0.75798285
aucpr 0.8428545
err 0.2451613
err_count 38.0
---
mean sd cv_1_valid
pr_auc 0.88372415 0.02987459 0.89833695
precision 0.77198565 0.04895599 0.7837838
r2 0.19584712 0.08572516 0.20389102
recall 0.9591504 0.029826047 0.96666664
rmse 0.41076636 0.02622249 0.40124473
specificity 0.33468577 0.17005084 0.31914893
cv_2_valid cv_3_valid cv_4_valid
pr_auc 0.91614044 0.8630901 0.8981989
precision 0.7887324 0.69736844 0.83064514
r2 0.28802457 0.07672223 0.2621514
recall 0.9655172 1.0 0.91964287
rmse 0.38554546 0.4484144 0.39196244
specificity 0.3877551 0.08 0.5531915
cv_5_valid
pr_auc 0.8428545
precision 0.7593985
r2 0.1484464
recall 0.94392526
rmse 0.4266648
specificity 0.33333334
Model Details:
==============
H2OBinomialModel: drf
Model ID: drf_grid_11_model_4
Model Summary:
number_of_trees number_of_internal_trees
1 300 300
model_size_in_bytes min_depth max_depth
1 178068 6 6
mean_depth min_leaves max_leaves mean_leaves
1 6.00000 28 55 42.49000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.167
RMSE: 0.4086
LogLoss: 0.5039
Mean Per-Class Error: 0.2986
AUC: 0.7958
AUCPR: 0.8948
Gini: 0.5916
R^2: 0.2057
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Bad Good Error Rate
Bad 128 113 0.468880 =113/241
Good 72 489 0.128342 =72/561
Totals 200 602 0.230673 =185/802
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold
1 max f1 0.588809
2 max f2 0.263705
3 max f0point5 0.670701
4 max accuracy 0.588809
5 max precision 0.969245
6 max recall 0.263705
7 max specificity 0.969245
8 max absolute_mcc 0.620096
9 max min_per_class_accuracy 0.676920
10 max mean_per_class_accuracy 0.700984
11 max tns 0.969245
12 max fns 0.969245
13 max fps 0.172355
14 max tps 0.263705
15 max tnr 0.969245
16 max fnr 0.969245
17 max fpr 0.172355
18 max tpr 0.263705
value idx
1 0.840929 274
2 0.921788 396
3 0.834331 217
4 0.769327 274
5 1.000000 0
6 1.000000 396
7 1.000000 0
8 0.435152 251
9 0.729055 213
10 0.734055 192
11 241.000000 0
12 560.000000 0
13 241.000000 399
14 561.000000 396
15 1.000000 0
16 0.998217 0
17 1.000000 399
18 1.000000 396
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.1673
RMSE: 0.409
LogLoss: 0.5034
Mean Per-Class Error: 0.3413
AUC: 0.7942
AUCPR: 0.8968
Gini: 0.5885
R^2: 0.2041
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Bad Good Error Rate
Bad 101 140 0.580913 =140/241
Good 57 504 0.101604 =57/561
Totals 158 644 0.245636 =197/802
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold
1 max f1 0.563329
2 max f2 0.367694
3 max f0point5 0.656225
4 max accuracy 0.569465
5 max precision 0.966395
6 max recall 0.309813
7 max specificity 0.966395
8 max absolute_mcc 0.654595
9 max min_per_class_accuracy 0.677673
10 max mean_per_class_accuracy 0.656225
11 max tns 0.966395
12 max fns 0.966395
13 max fps 0.224467
14 max tps 0.309813
15 max tnr 0.966395
16 max fnr 0.966395
17 max fpr 0.224467
18 max tpr 0.309813
value idx
1 0.836515 298
2 0.922747 382
3 0.834299 231
4 0.754364 293
5 1.000000 0
6 1.000000 393
7 1.000000 0
8 0.436647 233
9 0.718360 213
10 0.729425 231
11 241.000000 0
12 560.000000 0
13 241.000000 399
14 561.000000 393
15 1.000000 0
16 0.998217 0
17 1.000000 399
18 1.000000 393
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
mean sd cv_1_valid
accuracy 0.7579888 0.029784564 0.7305389
auc 0.7938621 0.023601508 0.79468083
aucpr 0.8911885 0.020987421 0.9034187
err 0.24201117 0.029784564 0.26946107
err_count 38.8 4.816638 45.0
cv_2_valid cv_3_valid cv_4_valid
accuracy 0.7878788 0.74358976 0.7924528
auc 0.8224842 0.76584905 0.8107903
aucpr 0.9098033 0.86775553 0.9060463
err 0.21212122 0.25641027 0.20754717
err_count 35.0 40.0 33.0
cv_5_valid
accuracy 0.7354839
auc 0.77550626
aucpr 0.8689187
err 0.26451612
err_count 41.0
---
mean sd cv_1_valid
pr_auc 0.8911885 0.020987421 0.9034187
precision 0.7727225 0.047828298 0.72727275
r2 0.20340157 0.024107175 0.19744903
recall 0.9353987 0.05225089 1.0
rmse 0.40912727 0.010530577 0.40286487
specificity 0.342424 0.22247364 0.04255319
cv_2_valid cv_3_valid cv_4_valid
pr_auc 0.9098033 0.86775553 0.9060463
precision 0.8292683 0.7619048 0.816
r2 0.22976469 0.17218669 0.22555843
recall 0.87931037 0.9056604 0.91071427
rmse 0.40100962 0.42459956 0.40156436
specificity 0.5714286 0.4 0.5106383
cv_5_valid
pr_auc 0.8689187
precision 0.7291667
r2 0.19204898
recall 0.9813084
rmse 0.4155979
specificity 0.1875
Not much difference here too, DRF seems only slightly better but that may change with fold assignment. For GBM, I did positive class upsample tuning but didn’t tune other hyperparameters. And for DRF I did the exact opposite. So, both the models have a lot of scope of tuning, and I am not at a stage to pick the right model
We can see feature importance of either GBM or DRF, but DRF gives a better plot without breaking categorical features into its classes, so we will use DRF.
Topp-3 features are “checking_account_status”, “duration_in_months”, and “credit_amount”
To profile a ‘Good’ credit worthy person as per the model, let’s explore the relationship of top predictors with the predicted class for the DRF model.
|
| | 0%
|
|=======================================| 100%
|
| | 0%
|
|=======================================| 100%
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
So, the best credit worthy person would have a following profile:
- checking_account_status is “A14” i.e. no checking account
- duration_in_months is less than 12 month i.e. a year
- credit_amount is less than 2k
- credit_history is “A34” i.e. critical account/other existing credits
- Purpose is A43 i.e. radio/television
This seems slightly unintuitive, but I will have to go into model explainibility to get better insights, and currently the time is short for that